A Description of Texts in a Corpus: 'Virtual' and 'Real' Corpora
نویسنده
چکیده
The extensive use of computer-based corpora for a range of language studies has led to the proliferation of the ways in which texts within an individual corpus are organised Basically, the organisation reflects the immediate needs of a group of well motivated users, like lexicographers or terminologists. This means that the subsequent generation of corpus users are forced to use a classification of texts according to categories they may not be familiar with or may not be comfortable with or both. There is an urgent need to have a facility in corpus management system that allow its users to use their own classification system to categorise texts in a corpus. That is, the users should be able to choose, for example, their own style, register, field, time span, author attributes for generating word lists, concordances, contextual examples etc. A lexicography/terminology management system, System Quirk, is described that can support such a virtual organisation of texts within a corpus. Introduction There are open questions in corpus linguistics related to how texts should be selected and, perhaps, more importantly for what purpose. Some argue that lexicographers and linguists should choose the texts themselves with some advice from teachers of English (Sinclair and colleagues in Sinclair 1987), whilst the corpus linguistics pioneers used a random-selection approach (cf. Lancaster Oslo Bergen Corpus and the Brown Corpus). Still others have argued that there should be an equal mixture of deliberately selected text and randomly selected text (see, for instance, Summers 1991). We hope that the discussion of how text is organised and, indeed, how representative text is chosen, will motivate the reader to consider various parameters that can label a text. These parameters may include the medium in which the text is delivered books, magazines, journals, leaflets, letters; the genre of the text, fiction or non-fiction, whether it is imaginative or informative, persuasive or instructional. The register and the domain of the text are equally important parameters. Furthermore, there are some atomic features of a text including author’s age and sex, publication period, language variety and so on. (One might consider the use of ‘contextual correlates’ described by Halliday to categorise texts in terms of their tenor and field, given that the mode of the language in the text corpora is textual). The LOB corpus was categorised into informative texts and imaginative texts. The latter category contains mainly works of fiction, ranging from detective fiction to
منابع مشابه
Vocabulary Lists for EAP and Conversation Students
Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...
متن کاملComparative Study of the Academic Vocabulary Content of Electronic Engi-neering Corpora, GE Materials and M.S. Entrance Examinations
The importance of vocabulary learning has been underlined in the field of English for Academic Purposes (EAP) because non-English majors who require reading English texts in their fields of study have to expand their English vocabulary knowledge much more efficiently than ordinary ESL/EFL learners. Since academic vocabulary instruction in Iranian universities is realized through the use of Gene...
متن کاملHedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners
Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملLinguistic variations and morphosyntactic annotation of Latin classical texts
This paper assesses the performance of three taggers (MBT, TnT and TreeTagger) when used for the morphosyntactic annotation of classical Latin texts. With this aim in view, we selected the training corpora, -as well as the samples used for tests-, from the texts of the LASLA database. The texts were chosen according to their ability to allow testing of the taggers sensitivity to stylistic, diac...
متن کامل